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Abstract 

This article describes posterior maximization 
for topic models, identifying computational 
and conceptual gains from inference under 
a non-standard parametrization. We then 
show that fitted parameters can be used as 
the basis for a novel approach to marginal 
likelihood estimation, via block-diagonal ap- 
proximation to the information matrix, that 
facilitates choosing the number of latent top- 
ics. This likelihood-based model selection is 
complemented with a goodness-of-fit analysis 
built around estimated residual dispersion. 
Examples are provided to illustrate model se- 
lection as well as to compare our estimation 
against standard alternative techniques. 



1 Introduction 

A topic model represents multivariate count data as 
multinomial observations parameterized by a weighted 
sum of latent topics. With each observation G 
{xi . . .x^} a vector of counts in p categories, given 
total count rrii = x^.-, the i^T-topic model has 



X,; 



-uJiKOK.rrii) 



(1) 



where topics Ok = [Oki • - • OkpY and weights u^i are 
probability vectors. The topic label is due to appli- 
cation of the model in ([T]) to the field of text anal- 
ysis. In this context, each x^ is a vector of counts 
for terms (words or phrases) in a document with total 
term-count m^, and each topic ^/e is a vector of prob- 
abilities over words. Documents are thus character- 
ized through a mixed-membership weighting of topic 
factors and, with K far smaller than p, each uji is a 
reduced dimension summary for x^. 



The author is supported by a Neubauer family faculty fel- 
lowship at Chicago Booth. Matthew Gentzkow and Jesse 
Shapiro are to thank for extensive discussion and feedback 
on this material. 



Section [2] surveys the wide use of topic models and em- 
phasizes some aspects of current technology that show 
room for improvement. Sect ion [2] 1 describes how com- 
mon large-data estimation for topics is based on max- 
imization of an approximation to the marginal likeli- 
hood, p(X|0). This involves very high-dimensional 
latent variable augmentation, which complicates and 
raises the cost of computation, and independence as- 
sumptions in approximation that can potentially bias 
estimation. Moreover, Section |2j2 reviews the very 
limited literature on topic selection, arguing the need 
of new methodology for choosing K. 

The two major objectives of this article are thus to fa- 
cilitate an efficient alternative for estimation of topic 
models, and to provide a default method for model 
selection. In the first case. Section |3] develops a 
framework for joint posterior maximization over both 
topics and weights: given the parameterization de- 
scribed in [3)1, we outline a block-relaxation algorithm 
in [3] 2 that augments expectation-maximization with 
quadratic programming for each u?^. Section [4] then 
presents two possible metrics for model choice: |4jl 
describes marginal data likelihood estimation through 
block-diagonal approximation to the information ma- 
trix, while [4] 2 proposes estimation for residual disper- 
sion. We provide simulation and data examples in Sec- 
tion 5 to support and illustrate our methods, and close 
with a short discussion in Section [6l 

2 Background 



The original text-motivated topic model is due to' Hof-| 
(1999), who describes its mixed- membership 



mann 



likelihood as a probability model for the latent seman- 



tic indexing of Deerwester et al. (1990). Blei et al. 



( 2003 ) then introduce the contemporary Bayesian for- 
mulation of topic models as latent Dirichlet alloca- 
tion (LDA) by adding conditionally conjugate Dirich- 
let priors for topics and weights. This basic model has 
proven hugely popular, and extensions include hierar- 
chical formulations to account for an unknown num- 
ber of topics ( |Teh et al. 2006, using Dirichlet pro- 
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cesses), topics that change in time (Blei and Lafferty 
2006 ) or whose expression is correlated ("Blei and Laf- 



ferty 



2007), and topics driven by sentiment (Blei and 



McAuliffe 20101). Equivalent likelihood models, un- 



der both classical and Bayesian formulation, have also 
been independently developed in genetics for analysis 



of population admixtures (e.g., Pritchard et al. 2000). 



2.1 Estimation Techniques 

This article, as in most text-mining applications, fo- 
cuses on a Bayesian specification of the model in ([T]) 
with independent priors for each 6^ and cj^, yielding 
a posterior distribution proportional to 



K 



p(0, 1^, X) = MN(x,; 0u;„ m,)p(u;,) p(^fc). 

i=l k=l 

(2) 

Posterior approximations rely on augmentation with 
topic membership for individual terms: assume term 
/ from document i has been drawn with probability 
given by topic Oz^i^ where membership zu is sampled 
from the K available according to cj^, and write as 
the m^-length indicator vector for each document i and 
Z = {zi, . . . , Zn} as the full latent parameter matrix. 

Gibbs sampling for this specification is described in the 



original genetic admixture paper by [Pritchard et al. 
(|2000|) and b y some in machine learning (e.g.,lGriffiths 
and Styvers 2004). Many software packages adapt 
Gibbs to large-data settings by using a single poste- 
rior draw of Z for parameter estimation. This only 
requires running the Markov chain until it has reached 
its stationary distribution, instead of the full conver- 
gence assumed in mean estimation, but the estimates 
are not optimal or consistent in any rigorous sense. 

It is more common in practice to see variational 



posterior approximations (Wainwright and Jordan 



2008), wherein some tractable distribution is fit to 



minimize its distance from the unknown true poste- 
rior. For example, consider the conditional posterior 
p(ri, Z I 0,X) and variational distribution q(ll, Z) = 
nr=i [Dir(cc;i; s,) YlT=i MN(^.z; vu)]. Kullback-Leibler 
divergence between these densities is 



= logp(X|0) 

-{E, [logp(f2,Z,X| e) 



(3) 



[logq(f2,Z)]}. 



which is minimized by maximizing the lower bound 
on marginal likelihood for 0, Eg [logp(ll, Z, X | 0)] — 
Eg [log q(ri, Z)], as a function of q's tuning parameters. 
Since iVi AL cji for i ^ I conditional on 0, q's main 
relaxation is assumed independence of Z. 



The objective in ([3| is propos ed in the original 
LDA paper by [Blei et al.| ( |2003[ ) p1 The most com- 
mon approach to estimation (e.g., [Blei and Lafferty[ 
|2007| Grimmer , 2010[ and examples in |Blei et al.| 
'2003') is then to maximize Eg [log p(ri, Z, X | 0)] given 
q(r2,Z) ^ p(ri, Z I 0,X). This mean-field estimation 
can be motivated as fitting to maximize the im- 
plied lower bound on p(X | 0) from ([3|, and thus 
provides approximate marginal maximum likelihood 
estimation. The procedure is customarily identified 
with its implementation as a variational EM (VEM) 
algorithm, which iteratively alternates between con- 
ditional minimization of ([3| and maximization of the 
bound on p(X | 0). 

A second strategy, full variational Bayes, constructs a 
joint distribution through multiplication of q(ri, Z) by 
q(0) = n/c=i Dir(^/c I u/e) and minimizes KL diver- 
gence against the full posterior. However, since cross- 
topic posterior correlation coT{0kj^0hj) does not dis- 
appear asymptoticall}{^ the independence assumption 
of q(0) risks inconsistency for and unstable finite 
sample results (e.g., Teh et alT| |2006[ ) . 



Our proposed approach is distinct from the above in 
seeking joint MAP solution for both and ft (i.e., 
that which maximizes ([2|), thus altogether avoiding 
posterior approximation. The estimation methodol- 
ogy of Section |3] is more closely connected to maxi- 
mum likelihood estimation (MLE) techniques from the 
non-Bayesian literature on topic models: the EM al- 
gorithm, as used extensively for genetics admixture 
estimation (e.g., |Tang et al. , 2006) and in Hoffman's 
1999 text-analysis work, and quadratic programming 
as applied by Alexander et al. (2009) in a fast block- 
relaxation routine. We borrow from both strategies. 

Comparison between MAP and VEM estimation is 
simple: the former finds jointly optimal profile esti- 
mates for and 11, while the latter yields inference 
for that is approximately integrated over uncer- 
tainty about ft and Z. There are clear advantages 
to integrating over nuisance parameters (e.g., Berger 



[et all [1999)), but marginalization comes at the ex- 
pense of introducing a very high dimensional latent 
parameter (Z) and its posterior approximation. This 
leads to algorithms that are not scaleable in document 
length, potentially with higher variance estimation or 
unknown bias. We will find that, given care in param- 
eterization, exact joint parameter estimation can be 
superior to approximate marginal inference. 



Teh et al. (2006) describe alternative approximation 
for the marginal indicator posterior, p(Z | X); this avoids 
conditioning on 0, but keeps independence assumptions 
over Z that will be even less accurate than they are in the 
conditional posterior. 

^See the information matrix injJjl. 
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2.2 Choosing the Number of Topics 

To learn the number of topics from data, the hterature 
makes use of three general tools: cross-validation, non- 
parametric mixture priors, and marginal likelihood. 



See Airoldi et al. (2010) for a short survey and compar- 



ison. Cross-validation (CV) is by far the most common 
choice (e.g. Grimmerj 2010| Griin and Hornik 2011). 
Unfortunately, due to the repeated model-fitting re- 
quired for out-of-sample prediction, CV is not a large- 
scale method. It also lacks easy interpretability in 
terms of statistical evidence, or even in terms of sam- 

7.12). For a model-based 



pie error (Hastie et al. 



alternative, Teh et al. 



2009, 

p006) write LDA as a Hierar- 



chical Dirichlet process with each document's weight- 
ing over topics a probability vector of infinite length. 
This removes the need to choose but estimation 
can be sensitive to the level of finite truncation for 
these prior processes and will always require inference 
about high-dimensional term-topic memberships. Fi- 
nally, the standard Bayesian solution is to maximize 
the marginal model posterior. However, marginal like- 
lihood estimation in topic models has thus far been 
limited to very rough approximation, such as the av- 



erage of MCMC draws in (Griffiths and Styvers, 2004). 



3 Parameter Estimation 



Prior choice for latent topics can be contentious (Wal- 



lach et al. 2009) and concrete guidance is lacking in 



the literature. We consider simple independent priors, 
but estimation updates based on conditional indepen- 
dence make it straightforward to adapt for more com- 
plex schemes. Our default specification follows from a 
general preference for simplicity. Given K topics. 



lid 



iid 



Dir(l/K), z = l...n, 
Dir(a/ei, . . . k = l...K. 



(4) 



The single Dirichlet concentration parameter oi 1/K 
for each uj encourages sparsity in document weights 
by placing prior density at the edges of the parameter 
space. This specification is also appropriate for model 
selection: weight of prior evidence is constant for all 
values of and as K moves to infinity Dir(l/i^) ap- 



proaches the Dirichlet process ( Neal , 2000| ) . Topic pri- 
ors are left generic in (|4| to encourage flexibility but 
we default to the low concentration akj = l/{Kp). 

3.1 Natural Exponential Family 
Parameterization 

To improve estimation stability and efficiency, we pro- 
pose to solve for MAP estimates of and not in the 
original simplex space, but rather after transform into 



their natural exponential family (NEF) parameteriza- 
tion. For example, in the case of we seek the MAP 
estimate for $ = {(/p^^? • • • ? ^nl? where for a given cj. 



exp[(pfe- 



, with the fixed element (/?o = 0. 

(5) 

Hence, ignoring ifi^ in estimation, each (f^ is an un- 
restricted vector of length K — 1. Since dook/diph = 
tk=h^k — ^k^hi the Jacobian for this transformation 
is Diag[u;] — ujuj' and has determinant |Diag[ct?] |(1 — 
uj'Bmg[uj]uj) = Ylk=i ^k{l-Y^h=i ^h)' Hence, view- 
ing each (jji as a function of (^^, the conditional poste- 
rior for each individual document given becomes 

P(^i(^i) I Xi) (6) 

K-l / K-1 \ 

MN(x,; 0u;„ m,) c^^^ 1 - E ' 

k=l \ h=l / 



OC . 



and the NEF conditional MAP is equivalent to solu- 
tion for LV under a Dir(l/i^ + 1) prior. Similarly, our 
estimate for each 0^ corresponds to the simplex MAP 
under a Dir(a/e + 1) prior |^ 

The NEF transformation leads to conditional posterior 
functions that are everywhere concave, thus guarantee- 
ing a single conditional MAP solution for each oji given 
0. This introduces stability into our block relaxation 
algorithm of [3]2: without moving to NEF space, the 
prior in (|4| with 1/i^^ < 1 could lead to ill-defined max- 
imization problems at each iteration. Conditional pos- 
terior concavity also implies non-boundary estimates 
for ri, despite our use of sparsity encouraging priors, 
that facilitate Laplace approximation in Section [4jl. 

3.2 Joint Posterior Maximization 

We now detail joint MAP estimation for ft and 
under NEF parameterization. First, note that it 
is straightforward to build an EM algorithm around 
missing data arguments. In the MLE literature (e.g., 
Hofmann 1999 Tang et al. , 2006) authors use the 



full set of latent phrase- memberships (Z) to obtain 
a mixture model specification. However, a lower di- 
mensional strategy is based on only latent topic totals, 
such that each document z = 1 . . . n is expanded 



X^^MNpiOutn)- 



-MNpiOK^tiK). (7) 



where ~ MNx(c«^i, m^), with Ti . . . treated as 
missing-data. Given current estimates and 11, stan- 
dard EM calculations lead to approximate likelihood 



^NEF parameterization thus remove s the "-1 offset" 
for MAP estimation critiqued by Asuncion et aL| (|2009|) 
wherein they note that different topic model algorithms 
can be made to provide similar fits through prior-tuning. 
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bounds of MN (Xk; OkJk^ , with 



ik = "^Xkj, k = l...K. 

(8) 

Updates for NEF topic MAPs under our model are 
then Okj = {xkj + Q^/cj)/[4 + Ej^i (^kj]- 

EM algorithms - even the low dimensional version in 
(|8| - are slow to converge under large-p vocabularies. 
We advocate finding exact solutions for 11 | at each 
iteration, as this can speed-up convergence by several 
orders of magnitude. Factorization of the conditional 
posterior makes for fast parallel updates that, similar 



to the algorithm of Alexander et al.| ( ,2009} , solve in- 
dependently for each cc?^ through sequential quadratic 
programming]^ We also add first-order quasi-Newton 
acceleration for full-set updates (Lange 2010). 



Suppressing i, each document's conditional log poste- 
rior is proportional to 



K 



/(CJ) = log {uiOij + . . .^UJKOKj)^Yl 



log(cjfe) 



k=l 



K 



(9) 

subject to the constraints I'u) = 1 and Uk > for 
k = 1 . . . K. Gradient and curvature are then 



9k 



f^kh 



j = l J 



1 



KuJk 



(10) 



XjUkj^hj 



^[k=h] 



-hi 




■ A " 




g 


V 




A 




_ 



and Taylor approximation around current estimate u) 
yields the linear system 



(11) 



where = {uok—ook) and A is the Lagrange multiplier 
for the equality constraint. This provides the basis for 
an active set strategy ( Luenberger and |2008[ 12.3): 
given A solving (11), take maximum 5 G (0,1) such 
that 5/^k < 



-ujk V/c, and set uj = a> + (5A. If S < 1 and 
ujk hes at boundary of the feasible region (i.e., some 
tolerance from zero), we activate that constraint by 
removing Ak and solving the simpler system. 



Note from ( 10 ) that our log posterior in ([9|) is concave, 
thus guaranteeing a unique solution at every iteration. 
This would not be true but for the fact that we are 



^Alexander et al. also use this approach to update 
I in a similar model, but we have found in our appli- 
cations that any advantage over EM for is out-weighed 
by computational expense in the high-dimensional pivoting 
required by constraints X)j=i ^kj = 1. 



actually solving for conditional MAP cp^ rather than 
(jj. While the full joint posterior for transformed 
and ft obviously remains multi-modal (most authors 
use multiple starts to avoid minor modes; we initialize 
with a build from 2, . . . , topics by repeatedly fitting 
an extra topic to the residuals), the NEF parameteri- 
zation introduces helpful stability at each iteration. 

4 Model Selection 

In this section, we propose two techniques for inferring 
the number of topics. The first approach, in[4jl, is an 
efficient approximation for the fully Bayesian proce- 
dure of marginal posterior maximization. The second 
approach, in[4j2, consists of a basic analysis of resid- 
uals. Both methods require almost no computation 
beyond the parameter estimation of Section |3] 

4.1 Marginal Likelihood Maximization 

Bayesian model selection is founded on the marginal 



data likelihood (see Kass and Raftery, 1995), and in 
the absence of a null hypothesis scenario or an infor- 
mative model prior, we wish to find K to maximize 
p(X I K) = /p(0,l^,X I K)dF{e,n). It should 
be possible to find a maximizing argument simply by 
evaluating p(X | K) over possible values. However, as 
is often the case, the integral is intractable for topic 
models and must be approximated. 



One powerful approach is Laplace's method (Tierney 
[and Kadane^ ^1986) wherein, assuming that the pos- 
terior is highly peaked around its mode, the joint 
parameter-data likelihood is replaced with a close- 
matching and easily integrated Gaussian density. In 
particular, quadratic expansion of log[p(X, 0)] is 
exponentiated to yield the approximate posterior, 
N([$,0]; [^,0],H), where [^,0] is the joint MAP 
and H is the log posterior Hessian evaluated at this 
point. After scaling by i^! to account for label switch- 
ing, these approximations have proven effective for 



general mixtures (e.g. Roeder and Wasserman, 1997). 



As one of many advantages of NEF parameterization 



(Mackay 1998), we avoid boundary solutions where 
Laplace's approximation would be invalid. The inte- 
gral target is also lower dimensional, although we leave 
in simplex representation due to a denser and harder 
to approximate Hessian after NEF transform. Hence, 

p(X \K)^-p (X, e,n)\- U\-H2n)^K\ (12) 

where d = Kp -\- {K — l)n is model dimension and 
p(X, 0, Cl)^ = Ylti MN(x,; Soj,, m,)Dir(u;,; 1/K + 1) 
Yl^^^ Dii {6 k;cxk + !)• In practice, to account for 
weight sparsity, we replace d with Kp + dn where do 
is the number of uJik greater than 1/1000. 
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After parameter estimation, the only additional com- 4.2 Residuals and Dispersion 



putational burden of (12) is finding the determinant 



of negative H. Unfortunately, given the large p and 
n of text analysis, this burden will usually be signif- 
icant and unaffordable. Determinant approximation 
for marginal likelihoods is not uncommon; for exam- 
ple, Bayes information criterion (BIG) can be moti- 
vated by setting | — H| ^ n^|i|, with i the information 
matrix for a single observation. Although BIG conver- 
gence results do not apply under d that depends on n, 
efficient computation is possible through a more subtle 
approximation based on block-diagonal factorization. 

Writing L = log [p(X, 0, ft)] , diagonal blocks of H are lowing 



Another strategy is to consider the simple connection 
between number of topics and model fit: given theoret- 
ical multinomial dispersion of = 1, and conditional 
on the topic-model data generating process of ([T]), any 
fitted overdispersion > 1 indicates a true K that is 
larger than the number of estimated topics. 

Gonditional on estimated probabilities = 00;^, 
each document's fitted phrase counts are Xij = qijrrii. 
Dispersion can be derived from the relationship 
E \{xij — Xjj)'^] ^ (j^migij(l — qij) and estimated fol- 



Haberman 



H(=> = 



— Diag 



del. 



= = Diag 



(1973) as the mean of squared ad- 
justed residuals {xij — Xij)/sij^ where s'^j = miqij{l 
Qij). Sample dispersion is then a'^ = D/u^ where 



(13) 



D 



E 



'ZxijXij 



{ij: Xij>0} 



miqij{l - Qij) 



n p 



where O.j = [Oij 



Okj] is the j^^ row of 0. Here, 



dOkjdOhj 



E 



1 



while for each individual document's Lp^^ 

d(pikO(pih ^ ^ 



kj 



E^ 



kj — Qij I 
t[k=h]^ik V^ik^ih 1 2 



kj ^hj 



Qij 



with Qij — Ylk^i'^ik^kj' Finally, the sparse off- 
diagonal blocks of H have elements 



Qij 
(15) 

and V is an estimate for its degrees of freedom. We 
use V = N — d^ with N the number of Xij greater than 
1/100. D also has approximate distribution under 
the hypothesis that cr^ = 1, and a test of this against 
alternative > 1 provides a very rough measure for 
evidence in favor of a larger number of topics. 

5 Examples 

Our methods are all implemented in the textir pack- 
age for R. For comparison, we also consider R-package 



dOjkdifih 



^ik^ih , 



^hj 



Qij 



[k^h] 



implementations of VEM (topicmodels 0.1-1, Griin and 
Hornikl |2011i) and collapsed Gibbs sampling (Ida 1.3.1, 



Ghang 



|2011p [ 



Although in each case efficiency could 
be improved through techniques such as parallel pro- 
cessing or thresholding for sparsity (e.g., see the 
wherever Xij ^ 0, and zero otherwise. Ignoring these SparseLDA of |Yao et "al| |2009[ ),^ we seek to present 
cross curvature terms, an approximate determinant is 
available as the product of determinants for each di- 



agonal block in (13). That is, our marginal likelihood 



estimate is as in equation (12), but with replacement 



HI 



-H© 








n 



(14) 

This is fast and easy to calculate, and we show in Sec- 
tion |5] that it performs well in finite sample examples. 
Asymptotic results for block diagonal determinant ap- 
proximations are available in Ip sen and Lee] (2011), 
including convergence rates and higher-order expan- 
sions. From the statistician's perspective, very sparse 
off-diagonal blocks contain only terms related to co- 
variance between shared topic vectors and an individ- 
ual document's weights, and we expect that as n ^ oo 
the influence of these elements will disappear. 



a baseline comparison of basic algorithms. The pri- 
ors of Section [3] are used throughout, and times are 
reported for computation on a 3.2 GHz Mac Pro. 

The original topicmodels code measures convergence on 
proportional change in the log posterior bound, leading 
to observed absolute log posterior change of 50-100 at 
termination under tolerance of 10~^. Such premature 
convergence leads to very poor results, and the code 
(rida.c line 412) was altered to match textir in tracking 
absolute change. Both routines then stop on a toler- 
ance of 0.1. Gibbs samplers were run 5000 iterations, 
and Ida provides a single posterior draw for estimation. 

5.1 Simulation Study 

We consider data simulated from a ten-topic model 
withp = 1000 dimensional topics Ok ~ Dir(l/10), k = 
1, . . . , 10, where each of n = 500 documents are gener- 
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Figure 1: From simulation under various values of M, the expected document size, plot (a) shows average log 
Bayes factors for K = 5-15 against the null one-topic model, and (b) shows estimated dispersion for iiT = 9-11. 
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Figure 2: Topic estimation given K = 10 via MAP, VEM, and Gibbs procedures for data with M = 200. In 
(a), elements of each used in simulation are graphed against the corresponding estimates, and (b) shows the 
distributions for log MSE of these estimates and for algorithm computation time. 



ated with weights cji ~ Dir(l/10) and phrase counts 
~ MN(0u;^,mi) given ~ Po(M). This yields 
topics and topic weights that are dominated by a sub- 
set of relatively large probability, and we have set sam- 
ple size at half vocabulary dimension to reflect the 
common p ^ n setting of text analysis. Expected 
size, M, varies to illustrate the effect of increased in- 
formation in 5 sets of 50 runs for M from 100 to 1600. 

To investigate model selection, we flt K = 5 . . . 15 top- 
ics to each simulated corpus. Figure ^graphs results. 
Marginal likelihood estimates are in^a as log Bayes 
factors, calculated over the null model of = 1. Apart 
from the low information M = 100, mean p(X | K) is 
maximized at the true model of = 10; this was very 
consistant across individual simulations, with K = 10 



chosen invariably for M > 200. Even at M = 100, 
where K = S was the most commonly chosen model, 
the likelihood is relatively flat before quickly dropping 
for K > 10. In[T]b, the sample dispersion distribution 
is shown for = 9, 10, 11 at each size speciflcation. Of 
primary interest, is almost always larger than one 
for K < 10 and less than one for K > 10. This pat- 
tern persists for un-plotted and separation across 
models increases with M. Estimated dispersion does 
appear to be biased low by roughly 1-6% depending 
on size, illustrating the difficulty of choosing effective 
degrees of freedom. As a result, our test of a more- 
topics-than-i^ alternative leads to p-values of p = 
for K < 10 and p = 1 for > 10. 

We then consider MAP, VEM, and Gibbs estimation 
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Figure 3: Model selection metrics for each dataset over a range of possible K. In each case, the plotted points 
are log Bayes factor in multiples of 10^ and the dashed line is estimated dispersion. 
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Figure 4: Out-of-sample performance for 50 repetitions training on 80% of data and validating on the left-out 
20%, with predictive ^ - • Xij log(^^ 0;^) in multiples of 10^ and computation time in minutes. 



for the true K = 10 model with 50 datasets simulated 
at M = 200. To account for label- switching, estimated 
topics were matched with true topics by pairing the 
columns of least sum squared difference. Results are 
shown in Figure [2] The plots in[2ja show largely accu- 
rate estimation for both VEM and MAP procedures, 
but with near-zero occasionally fit at large 6 and 
vice versa. This occurs when some vocabulary prob- 
abilities are swapped in estimation across topics, an 
event that is not surprising under the weak identifi- 
cation of our high-dimensional latent variable model. 
It appears in |2]b that MAP does have slightly lower 
MSE, even though VEM takes at least 2-3 times longer 
to converge. Finally, as would be expected of estimates 
based on a single draw from a short MCMC run, Gibbs 
MSE are far larger than for either alternative. 

5.2 Data Analysis 



The data sets we consider are detailed in Taddy (2012) 
and included as examples in textir. WeSthere consists 
of counts for 2804 bigrams in 6175 online restaurant 
reviews, and Congress 1 09 wdiS compiled by^Gentzkowj 



'and Shapiro" ('2010') from the 109^^ US Congress as 529 
legislators' usage counts for each of 1000 bigrams and 
trigrams pre-selected for partisanship. 

Model selection results are in Figure [3] The marginal 
likelihood surfaces, again expressed as Bayes factors, 
are maximized at = 20 for the we8there data and at 
K = 12 for congress 109. Interestingly, dispersion esti- 
mates remain larger than one for these chosen models, 
and we do not approach = 1 even for K up to 
200. This indicates alternative sources of overdisper- 
sion beyond topic-clustering, such as correlation be- 
tween counts across phrases in a given document. 

Working with the likelihood maximizing topic mod- 
els, we then compare estimators by repeatedly fitting 
on a random 80% of data and calculating pre- 
dictive probability over the left-out 20%. In each 
case, new document phrase probabilities were calcu- 
lated as SuJi using the conditional MAP for under 
a Dir(l/i^) prior. Figure [4] presents results. As in 
simulation, the MAP algorithm appears to dominate 
VEM: predictive probability is higher for MAP in the 
we8there example and near identical across methods 



Topic Estimation and Selection 




1. dropout. prevention. program, american. force. radio, national. endowment. art, head. start, flood. insurance. program (0.12) 

2. near. earth. object, republic. cypru, winning. war. iraq, bless. america, troop. bring. home (0.12) 

3. near. retirement. age, commonly.prescribed.drug, repeal. death. tax, increase. taxe, medic. liability.crisi (0.12) 

4. va. health. care, united. airline. employe, security.private. account, private. account, issue. facing. american (0.11) 

5. southeast. texa, temporary. worker. program, guest. worker. program, million. illegal. immigrant, guest. worker (0.11) 

6. national. heritage. corridor, asian. pacific. american, Columbia. river. gorge, american. heritage. month (0.10) 

7. ready. mixed. concrete, driver. education, witness. testify, Indian. affair, president. announce (0.08) 

8. low. cost. reliable, wild. bird, suppli. natural. ga, arctic. wildlife. refuge, price. natural. ga (0.06) 

9. judicial. confirmation. process, fifth. circuit. court, chief .justice. rehnquist, summa.cum.laude, chief .justice (0.05) 

10. north. american. fre, american. fre. trade, change. heart. mind, financial. accounting. standard, central. american. fre (0.05) 

11. pluripotent. stem. eel, national. ad. campaign, cel. stem. eel, produce. stem. eel, embryonic. stem (0.04) 

12. able.buy.gun, deep. sea. coral, buy.gun, credit. card. industry, caliber. sniper. rifle (0.04) 



Figure 5: Congressl09 topics, summarized as their top-five terms by lift - Okj over empirical term probability - 
and ordered by usage proportion (column- means of fi, which are included in parentheses). The image plot has 
cells shaded by magnitude of uoik^ with Republicans in red and Democrats in blue. 



for congressl09, while convergence under VEM takes 
many times longer. Gibbs sampling is quickly able 
to find a neighborhood of decent posterior probability, 
such that it performs well relative to VEM but worse 
than the MAP estimators. 

Finally, to illustrate the analysis facilitated by these 
models. Figure [5] offers a brief summary of the 
congressl09 example. Top phrases from each topic 
are presented after ranking by term-lift, O^j/qj where 
qj = ^ij/ X]^=i and the image plots party 

segregation across topics. Although language cluster- 
ing is variously ideological, geographical, and event 
driven, some topics appear strongly partisan (e.g., 3 
for Republicans and 4 for Democrats). Top-lift terms 
in the weSthere example show similar variety, with 
some topics motivated by quality, value or service: 

1. anoth.minut, flag. down, over.minut, wait. over, arriv. after (0.07) 

2. great. place, food. great, place. eat, well. worth, great. price (0.06) 

while others are associated with specific styles of food: 

9. Chicago. style, crust. pizza, thin. crust, pizza. place, deep. dish (0.05) 

11. mexican.food, mexican. restaur, authent.mexican, best.mexican (0.05). 

In both examples, the model provides massive dimen- 
sion reduction (from 1000 to 12 and from 2804 to 20) 
by replacing individual phrases with topic weights. 



6 Discussion 

Results from Section 5 offer some general support for 
our methodology. In model selection, the marginal 
likelihood approximation appears to provide for effi- 
cient data-driven selection of the number of latent top- 
ics. Since a default approach to choosing K has been 
thus far absent from the literature, this should be of 
use in the practice of topic modeling. We note that 
the same block-diagonal Laplace approximation can 
be applied as a basis for inference and fast posterior 
interval calculations. Dispersion shows potential for 
measuring goodness-of-fit, but its role is complicated 
by bias and alternative sources of overdispersion. 

We were pleased to find that the efficiency of joint 
MAP estimation did not lead to lower quality fit, but 
rather uniformly met or outperformed alternative es- 
timates. The simple algorithm of this paper is also 
straightforward to scale for large-data analyses. In a 
crucial step, the independent updates for each 
can be processed in parallel; our experience is that this 
allows fitting 20 or more topics to hundreds of thou- 
sands of documents and tens of thousands of unique 
terms in less than ten minutes on a common desktop. 



M. A. Taddy 
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